Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently, where the degree of partial SIMD parallelism is smaller than the SIMD datapath width. We present a simple yet effective SLP compiler technique called PAVER (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing/ unpacking, and/or masking operations, without introducing new memory errors or new numeric exceptions. For a set of 9 C/C++/Fortran applications with partial SIMD parallelism, PAVER achieves significantly better kernel and whole-program speedups than LLVM on both Intel's AVX and ARM's NEON.
展开▼
机译:现有的矢量化技术对于显示出很少的循环级并行性但有限的超字级并行性(SLP)的循环无效。我们表明,有效地矢量化此类循环需要正确且有效地执行部分矢量运算,其中部分SIMD并行度小于SIMD数据路径宽度。我们提出了一种简单而有效的SLP编译器技术,称为PAVER(PArtial VEctorizeR),在LLVM中作为传统SLP算法的泛化形式制定和实施,以优化这种部分可矢量化的循环。关键思想是通过扩展使用的向量指令来最大化SIMD利用率,同时最大程度地减少由内存访问,打包/拆包和/或屏蔽操作导致的开销,而不会引入新的内存错误或新的数字异常。对于一组9个具有部分SIMD并行性的C / C ++ / Fortran应用程序,PAVER与Intel的AVX和ARM的NEON上的LLVM相比,可实现比LLVM更好的内核和整个程序加速。
展开▼